DOMAIN: Industrial safety. NLP based Chatbot.

CONTEXT:

The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.

DATA DESCRIPTION:

This The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.

Columns description:

  • Data: timestamp or time/date information
  • Countries: which country the accident occurred (anonymised)
  • Local: the city where the manufacturing plant is located (anonymised)
  • Industry sector: which sector the plant belongs to
  • Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
  • Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
  • Genre: if the person is male of female
  • Employee or Third Party: if the injured person is an employee or a third party
  • Critical Risk: some description of the risk involved in the accident
  • Description: Detailed description of how the accident happened.

Link to download the dataset: https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database [ for your reference only ]

Step 0: Import Libraries

In [4]:
#!pip install contractions seaborn
In [1]:
!pip install wordcloud
Requirement already satisfied: wordcloud in c:\programs\anaconda3\envs\tf\lib\site-packages (1.8.2.2)
Requirement already satisfied: pillow in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (9.2.0)
Requirement already satisfied: numpy>=1.6.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (1.22.4)
Requirement already satisfied: matplotlib in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (4.34.4)
Requirement already satisfied: pyparsing>=2.2.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: cycler>=0.10 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: six>=1.5 in c:\programs\anaconda3\envs\tf\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
In [1]:
import os
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS
import random as python_random

from gensim.models import Word2Vec
from tqdm import tqdm

from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score, precision_score, classification_report, precision_recall_fscore_support, make_scorer
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

from tensorflow import get_logger
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Flatten, Activation, Dense, LSTM, BatchNormalization, Embedding, Dropout, Bidirectional, GlobalMaxPool1D, Conv1D, MaxPooling1D
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.random import set_seed

# import lightgbm as lgb

from keras.callbacks import ReduceLROnPlateau, ReduceLROnPlateau, EarlyStopping, Callback
from keras.layers import Input
from keras.constraints import unit_norm
from keras.regularizers import l2

import missingno as mno
import holidays
from string import punctuation

import warnings
warnings.filterwarnings('ignore')

import contractions
import pickle

import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet, brown
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')

from google.colab import drive, files
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\prije\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Out[1]:
True
In [198]:
from imblearn.over_sampling import RandomOverSampler
from sklearn import preprocessing
from tensorflow.keras.backend import clear_session
from tensorflow.keras.models import load_model
import joblib

Milestone 1¶

Step 1: Load the dataset

a) Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/Capstone project")

b) List the files available on the drive

In [2]:
!ls
 Volume in drive D is DATA
 Volume Serial Number is 7042-8DAE

 Directory of D:\Prijesh\study\GreatLearningAIML\047-capstone-project\final

24-07-2022  07:40 PM    <DIR>          .
24-07-2022  07:40 PM    <DIR>          ..
24-07-2022  07:40 PM    <DIR>          .ipynb_checkpoints
24-07-2022  07:40 PM         6,993,322 Group13_NLP2_July21A_Capstone_Project (3).ipynb
10-06-2022  12:31 PM            35,695 IHMStefanini_industrial_safety_and_health_database.csv
10-06-2022  12:31 PM           193,631 IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv
               3 File(s)      7,222,648 bytes
               3 Dir(s)  20,169,076,736 bytes free

c) Read the dataset files

In [5]:
dataset1 = pd.read_csv('IHMStefanini_industrial_safety_and_health_database.csv')
dataset2 = pd.read_csv('IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv')

d) Print first 5 records of dataset1 & dataset2

In [6]:
dataset1.head()
Out[6]:
Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee ou Terceiro Risco Critico
0 2016-01-01 00:00:00 Country_01 Local_01 Mining I IV Male Third Party Pressed
1 2016-01-02 00:00:00 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems
2 2016-01-06 00:00:00 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools
3 2016-01-08 00:00:00 Country_01 Local_04 Mining I I Male Third Party Others
4 2016-01-10 00:00:00 Country_01 Local_04 Mining IV IV Male Third Party Others
In [7]:
dataset2.head()
Out[7]:
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 00:00:00 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 00:00:00 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 00:00:00 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 00:00:00 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 00:00:00 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...

e) Verify the columns of dataset1 & dataset2

In [8]:
dataset1.columns
Out[8]:
Index(['Data', 'Countries', 'Local', 'Industry Sector', 'Accident Level',
       'Potential Accident Level', 'Genre', 'Employee ou Terceiro',
       'Risco Critico'],
      dtype='object')
In [9]:
dataset2.columns
Out[9]:
Index(['Unnamed: 0', 'Data', 'Countries', 'Local', 'Industry Sector',
       'Accident Level', 'Potential Accident Level', 'Genre',
       'Employee or Third Party', 'Critical Risk', 'Description'],
      dtype='object')

f) Verify the shape of dataset1 & dataset2

In [10]:
dataset1.shape, dataset2.shape
Out[10]:
((439, 9), (425, 11))

We can conclude that dataset1 is having 439 records and 9 columns and dataset2 is having 425 records and 11 columns.

g) Dataset finalization

Since it is an NLP problem and the Description field is a mandatory item, we can proceed further with dataset2. So we are finalizing dataset2 for further processing by assinging it to a variable called df

In [11]:
df = dataset2.copy()

h) Verify data types

In [12]:
df.dtypes
Out[12]:
Unnamed: 0                   int64
Data                        object
Countries                   object
Local                       object
Industry Sector             object
Accident Level              object
Potential Accident Level    object
Genre                       object
Employee or Third Party     object
Critical Risk               object
Description                 object
dtype: object

Step 2: Data Cleansing

a) Remove irrelevant columns

As we already mentioned, we can remove the Unnamed: 0 column as it contains only the index values.

In [13]:
df.drop("Unnamed: 0", axis=1, inplace=True)

b) Rename the columns

Let's rename the columns with meaningful names.

In [14]:
df.rename(columns={'Data':'Date', 'Countries':'Country', 'Genre':'Gender', 'Employee or Third Party':'Employee type'}, inplace=True)
df.head(3)
Out[14]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
0 2016-01-01 00:00:00 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 2016-01-02 00:00:00 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2016-01-06 00:00:00 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...

c) Verify duplicate data

Let's verify the duplicate rows on the dataset.

In [15]:
df.duplicated().sum()
Out[15]:
7
In [16]:
duplicates = df.duplicated()

df[duplicates]
Out[16]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description
77 2016-04-01 00:00:00 Country_01 Local_01 Mining I V Male Third Party (Remote) Others In circumstances that two workers of the Abrat...
262 2016-12-01 00:00:00 Country_01 Local_03 Mining I IV Male Employee Others During the activity of chuteo of ore in hopper...
303 2017-01-21 00:00:00 Country_02 Local_02 Mining I I Male Third Party (Remote) Others Employees engaged in the removal of material f...
345 2017-03-02 00:00:00 Country_03 Local_10 Others I I Male Third Party Venomous Animals On 02/03/17 during the soil sampling in the re...
346 2017-03-02 00:00:00 Country_03 Local_10 Others I I Male Third Party Venomous Animals On 02/03/17 during the soil sampling in the re...
355 2017-03-15 00:00:00 Country_03 Local_10 Others I I Male Third Party Venomous Animals Team of the VMS Project performed soil collect...
397 2017-05-23 00:00:00 Country_01 Local_04 Mining I IV Male Third Party Projection of fragments In moments when the 02 collaborators carried o...

d) Drop duplicate data

Let's drop the duplicate rows on the dataset.

In [17]:
df.drop_duplicates(inplace=True)
In [18]:
df.shape
Out[18]:
(418, 10)

e) Print unique values for each fields

Let's print the unique values for each fields on the dataset(excluding Description field).

In [19]:
for x in df.columns:
    if x != 'Description':
      print('--'*30); print(f'Unique values of "{x}" column'); print('--'*30)
      print(df[x].unique())
      print('\n')
------------------------------------------------------------
Unique values of "Date" column
------------------------------------------------------------
['2016-01-01 00:00:00' '2016-01-02 00:00:00' '2016-01-06 00:00:00'
 '2016-01-08 00:00:00' '2016-01-10 00:00:00' '2016-01-12 00:00:00'
 '2016-01-16 00:00:00' '2016-01-17 00:00:00' '2016-01-19 00:00:00'
 '2016-01-26 00:00:00' '2016-01-28 00:00:00' '2016-01-30 00:00:00'
 '2016-02-01 00:00:00' '2016-02-02 00:00:00' '2016-02-04 00:00:00'
 '2016-02-06 00:00:00' '2016-02-07 00:00:00' '2016-02-08 00:00:00'
 '2016-02-21 00:00:00' '2016-02-25 00:00:00' '2016-02-09 00:00:00'
 '2016-02-10 00:00:00' '2016-02-15 00:00:00' '2016-02-14 00:00:00'
 '2016-02-13 00:00:00' '2016-02-16 00:00:00' '2016-02-17 00:00:00'
 '2016-02-19 00:00:00' '2016-02-20 00:00:00' '2016-02-18 00:00:00'
 '2016-02-22 00:00:00' '2016-02-24 00:00:00' '2016-02-29 00:00:00'
 '2016-02-26 00:00:00' '2016-02-27 00:00:00' '2016-03-02 00:00:00'
 '2016-03-03 00:00:00' '2016-03-04 00:00:00' '2016-03-05 00:00:00'
 '2016-03-06 00:00:00' '2016-03-09 00:00:00' '2016-03-11 00:00:00'
 '2016-03-13 00:00:00' '2016-03-12 00:00:00' '2016-03-14 00:00:00'
 '2016-03-16 00:00:00' '2016-03-10 00:00:00' '2016-03-17 00:00:00'
 '2016-03-18 00:00:00' '2016-03-19 00:00:00' '2016-03-22 00:00:00'
 '2016-03-25 00:00:00' '2016-03-30 00:00:00' '2016-03-31 00:00:00'
 '2016-04-01 00:00:00' '2016-04-03 00:00:00' '2016-04-02 00:00:00'
 '2016-03-24 00:00:00' '2016-04-04 00:00:00' '2016-04-05 00:00:00'
 '2016-04-07 00:00:00' '2016-04-08 00:00:00' '2016-04-11 00:00:00'
 '2016-04-14 00:00:00' '2016-04-16 00:00:00' '2016-04-15 00:00:00'
 '2016-04-17 00:00:00' '2016-04-18 00:00:00' '2016-04-21 00:00:00'
 '2016-04-22 00:00:00' '2016-04-23 00:00:00' '2016-04-26 00:00:00'
 '2016-04-28 00:00:00' '2016-04-29 00:00:00' '2016-04-30 00:00:00'
 '2016-05-01 00:00:00' '2016-05-02 00:00:00' '2016-05-04 00:00:00'
 '2016-05-03 00:00:00' '2016-05-05 00:00:00' '2016-05-11 00:00:00'
 '2016-05-12 00:00:00' '2016-05-14 00:00:00' '2016-05-17 00:00:00'
 '2016-05-19 00:00:00' '2016-05-18 00:00:00' '2016-05-22 00:00:00'
 '2016-05-20 00:00:00' '2016-05-24 00:00:00' '2016-05-25 00:00:00'
 '2016-05-27 00:00:00' '2016-05-26 00:00:00' '2016-06-01 00:00:00'
 '2016-06-02 00:00:00' '2016-06-03 00:00:00' '2016-06-04 00:00:00'
 '2016-06-05 00:00:00' '2016-06-08 00:00:00' '2016-06-07 00:00:00'
 '2016-06-10 00:00:00' '2016-06-13 00:00:00' '2016-06-16 00:00:00'
 '2016-06-18 00:00:00' '2016-06-17 00:00:00' '2016-06-19 00:00:00'
 '2016-06-21 00:00:00' '2016-06-22 00:00:00' '2016-06-23 00:00:00'
 '2016-06-24 00:00:00' '2016-06-29 00:00:00' '2016-07-02 00:00:00'
 '2016-07-04 00:00:00' '2016-07-08 00:00:00' '2016-07-07 00:00:00'
 '2016-07-09 00:00:00' '2016-07-10 00:00:00' '2016-07-11 00:00:00'
 '2016-07-14 00:00:00' '2016-07-15 00:00:00' '2016-07-16 00:00:00'
 '2016-07-18 00:00:00' '2016-07-20 00:00:00' '2016-07-21 00:00:00'
 '2016-07-23 00:00:00' '2016-07-27 00:00:00' '2016-07-29 00:00:00'
 '2016-07-30 00:00:00' '2016-08-02 00:00:00' '2016-08-01 00:00:00'
 '2016-08-04 00:00:00' '2016-08-11 00:00:00' '2016-08-12 00:00:00'
 '2016-08-14 00:00:00' '2016-08-15 00:00:00' '2016-08-18 00:00:00'
 '2016-08-19 00:00:00' '2016-08-22 00:00:00' '2016-08-24 00:00:00'
 '2016-08-25 00:00:00' '2016-08-29 00:00:00' '2016-08-27 00:00:00'
 '2016-08-30 00:00:00' '2016-09-01 00:00:00' '2016-09-02 00:00:00'
 '2016-09-04 00:00:00' '2016-09-03 00:00:00' '2016-09-06 00:00:00'
 '2016-09-05 00:00:00' '2016-09-13 00:00:00' '2016-09-12 00:00:00'
 '2016-09-15 00:00:00' '2016-09-17 00:00:00' '2016-09-16 00:00:00'
 '2016-09-20 00:00:00' '2016-09-21 00:00:00' '2016-09-22 00:00:00'
 '2016-09-27 00:00:00' '2016-09-29 00:00:00' '2016-09-30 00:00:00'
 '2016-10-01 00:00:00' '2016-10-03 00:00:00' '2016-10-04 00:00:00'
 '2016-10-08 00:00:00' '2016-10-10 00:00:00' '2016-10-11 00:00:00'
 '2016-10-13 00:00:00' '2016-10-18 00:00:00' '2016-10-20 00:00:00'
 '2016-10-23 00:00:00' '2016-10-24 00:00:00' '2016-10-26 00:00:00'
 '2016-10-27 00:00:00' '2016-10-29 00:00:00' '2016-11-04 00:00:00'
 '2016-11-08 00:00:00' '2016-11-11 00:00:00' '2016-11-13 00:00:00'
 '2016-11-19 00:00:00' '2016-11-21 00:00:00' '2016-11-23 00:00:00'
 '2016-11-25 00:00:00' '2016-11-28 00:00:00' '2016-11-29 00:00:00'
 '2016-11-30 00:00:00' '2016-12-01 00:00:00' '2016-12-08 00:00:00'
 '2016-12-09 00:00:00' '2016-12-10 00:00:00' '2016-12-12 00:00:00'
 '2016-12-13 00:00:00' '2016-12-15 00:00:00' '2016-12-16 00:00:00'
 '2016-12-19 00:00:00' '2016-12-23 00:00:00' '2016-12-22 00:00:00'
 '2016-12-26 00:00:00' '2016-12-28 00:00:00' '2016-12-30 00:00:00'
 '2016-12-31 00:00:00' '2017-01-02 00:00:00' '2017-01-05 00:00:00'
 '2017-01-06 00:00:00' '2017-01-07 00:00:00' '2017-01-08 00:00:00'
 '2017-01-09 00:00:00' '2017-01-10 00:00:00' '2017-01-12 00:00:00'
 '2017-01-14 00:00:00' '2017-01-17 00:00:00' '2017-01-20 00:00:00'
 '2017-01-21 00:00:00' '2017-01-23 00:00:00' '2017-01-24 00:00:00'
 '2017-01-25 00:00:00' '2017-01-27 00:00:00' '2017-01-29 00:00:00'
 '2017-01-28 00:00:00' '2017-01-31 00:00:00' '2017-02-01 00:00:00'
 '2017-02-04 00:00:00' '2017-02-05 00:00:00' '2017-02-07 00:00:00'
 '2017-02-08 00:00:00' '2017-02-09 00:00:00' '2017-02-13 00:00:00'
 '2017-02-14 00:00:00' '2017-02-15 00:00:00' '2017-02-16 00:00:00'
 '2017-02-17 00:00:00' '2017-02-23 00:00:00' '2017-02-25 00:00:00'
 '2017-02-26 00:00:00' '2017-02-27 00:00:00' '2017-03-01 00:00:00'
 '2017-03-02 00:00:00' '2017-03-04 00:00:00' '2017-03-06 00:00:00'
 '2017-03-08 00:00:00' '2017-03-09 00:00:00' '2017-03-10 00:00:00'
 '2017-03-15 00:00:00' '2017-03-18 00:00:00' '2017-03-22 00:00:00'
 '2017-03-25 00:00:00' '2017-03-31 00:00:00' '2017-04-04 00:00:00'
 '2017-04-05 00:00:00' '2017-04-07 00:00:00' '2017-04-06 00:00:00'
 '2017-04-10 00:00:00' '2017-04-08 00:00:00' '2017-04-11 00:00:00'
 '2017-04-13 00:00:00' '2017-04-12 00:00:00' '2017-04-23 00:00:00'
 '2017-04-19 00:00:00' '2017-04-25 00:00:00' '2017-04-24 00:00:00'
 '2017-04-28 00:00:00' '2017-04-29 00:00:00' '2017-04-30 00:00:00'
 '2017-05-05 00:00:00' '2017-05-06 00:00:00' '2017-05-10 00:00:00'
 '2017-05-16 00:00:00' '2017-05-17 00:00:00' '2017-05-18 00:00:00'
 '2017-05-19 00:00:00' '2017-05-23 00:00:00' '2017-05-30 00:00:00'
 '2017-06-04 00:00:00' '2017-06-09 00:00:00' '2017-06-11 00:00:00'
 '2017-06-14 00:00:00' '2017-06-15 00:00:00' '2017-06-17 00:00:00'
 '2017-06-18 00:00:00' '2017-06-24 00:00:00' '2017-06-20 00:00:00'
 '2017-06-23 00:00:00' '2017-06-19 00:00:00' '2017-06-22 00:00:00'
 '2017-06-29 00:00:00' '2017-07-04 00:00:00' '2017-07-05 00:00:00'
 '2017-07-06 00:00:00' '2017-07-09 00:00:00']


------------------------------------------------------------
Unique values of "Country" column
------------------------------------------------------------
['Country_01' 'Country_02' 'Country_03']


------------------------------------------------------------
Unique values of "Local" column
------------------------------------------------------------
['Local_01' 'Local_02' 'Local_03' 'Local_04' 'Local_05' 'Local_06'
 'Local_07' 'Local_08' 'Local_10' 'Local_09' 'Local_11' 'Local_12']


------------------------------------------------------------
Unique values of "Industry Sector" column
------------------------------------------------------------
['Mining' 'Metals' 'Others']


------------------------------------------------------------
Unique values of "Accident Level" column
------------------------------------------------------------
['I' 'IV' 'III' 'II' 'V']


------------------------------------------------------------
Unique values of "Potential Accident Level" column
------------------------------------------------------------
['IV' 'III' 'I' 'II' 'V' 'VI']


------------------------------------------------------------
Unique values of "Gender" column
------------------------------------------------------------
['Male' 'Female']


------------------------------------------------------------
Unique values of "Employee type" column
------------------------------------------------------------
['Third Party' 'Employee' 'Third Party (Remote)']


------------------------------------------------------------
Unique values of "Critical Risk" column
------------------------------------------------------------
['Pressed' 'Pressurized Systems' 'Manual Tools' 'Others'
 'Fall prevention (same level)' 'Chemical substances' 'Liquid Metal'
 'Electrical installation' 'Confined space'
 'Pressurized Systems / Chemical Substances'
 'Blocking and isolation of energies' 'Suspended Loads' 'Poll' 'Cut'
 'Fall' 'Bees' 'Fall prevention' '\nNot applicable' 'Traffic' 'Projection'
 'Venomous Animals' 'Plates' 'Projection/Burning' 'remains of choco'
 'Vehicles and Mobile Equipment' 'Projection/Choco' 'Machine Protection'
 'Power lock' 'Burn' 'Projection/Manual Tools'
 'Individual protection equipment' 'Electrical Shock'
 'Projection of fragments']


f) Analyze missing values

Let's analyze the missing values available on the dataset.

In [20]:
df.isnull().sum()
Out[20]:
Date                        0
Country                     0
Local                       0
Industry Sector             0
Accident Level              0
Potential Accident Level    0
Gender                      0
Employee type               0
Critical Risk               0
Description                 0
dtype: int64
In [21]:
mno.matrix(df, figsize = (10, 4));

Observation: Here we can see the current dataset doesn’t have any missing values.

Step 3: Data Pre-processing for Visualization

Reusable functions for data preprocessing

In [22]:
def month2seasons(x):
    if x in [9, 10, 11]:
        season = 'Spring'
    elif x in [12, 1, 2]:
        season = 'Summer'
    elif x in [3, 4, 5]:
        season = 'Autumn'
    elif x in [6, 7, 8]:
        season = 'Winter'
    return season

def preprocess_data(df):
  df['Date'] = pd.to_datetime(df['Date'])
  df['Year'] = df['Date'].apply(lambda x : x.year)
  df['Month'] = df['Date'].apply(lambda x : x.month)
  df['Day'] = df['Date'].apply(lambda x : x.day)
  df['Weekday'] = df['Date'].apply(lambda x : x.day_name())
  df['WeekofYear'] = df['Date'].apply(lambda x : x.weekofyear)
  df['Season'] = df['Month'].apply(month2seasons)
  df['Is_Holiday'] = [1 if str(val).split()[0] in brazil_holidays else 0 for val in df['Date']]
  return df

def print_brazil_holidays(year):
  print('--'*40); print('List of Brazil holidays in ' + str(year)); print('--'*40)
  for date in holidays.Brazil(years = year).items():
      print(date)

def get_brazil_holidays(years):
  brazil_holidays = []
  for date in holidays.Brazil(years = years).items():
    brazil_holidays.append(str(date[0]))
  return brazil_holidays

Print Holidays in Brazil for the years 2016 and 2017

In [23]:
print_brazil_holidays(2016)
print_brazil_holidays(2017)
--------------------------------------------------------------------------------
List of Brazil holidays in 2016
--------------------------------------------------------------------------------
(datetime.date(2016, 1, 1), 'Ano novo')
(datetime.date(2016, 4, 21), 'Tiradentes')
(datetime.date(2016, 5, 1), 'Dia Mundial do Trabalho')
(datetime.date(2016, 9, 7), 'Independência do Brasil')
(datetime.date(2016, 10, 12), 'Nossa Senhora Aparecida')
(datetime.date(2016, 11, 2), 'Finados')
(datetime.date(2016, 11, 15), 'Proclamação da República')
(datetime.date(2016, 12, 25), 'Natal')
(datetime.date(2016, 3, 25), 'Sexta-feira Santa')
(datetime.date(2016, 3, 27), 'Páscoa')
(datetime.date(2016, 5, 26), 'Corpus Christi')
(datetime.date(2016, 2, 10), 'Quarta-feira de cinzas (Início da Quaresma)')
(datetime.date(2016, 2, 9), 'Carnaval')
--------------------------------------------------------------------------------
List of Brazil holidays in 2017
--------------------------------------------------------------------------------
(datetime.date(2017, 1, 1), 'Ano novo')
(datetime.date(2017, 4, 21), 'Tiradentes')
(datetime.date(2017, 5, 1), 'Dia Mundial do Trabalho')
(datetime.date(2017, 9, 7), 'Independência do Brasil')
(datetime.date(2017, 10, 12), 'Nossa Senhora Aparecida')
(datetime.date(2017, 11, 2), 'Finados')
(datetime.date(2017, 11, 15), 'Proclamação da República')
(datetime.date(2017, 12, 25), 'Natal')
(datetime.date(2017, 4, 14), 'Sexta-feira Santa')
(datetime.date(2017, 4, 16), 'Páscoa')
(datetime.date(2017, 6, 15), 'Corpus Christi')
(datetime.date(2017, 3, 1), 'Quarta-feira de cinzas (Início da Quaresma)')
(datetime.date(2017, 2, 28), 'Carnaval')

Get Holidays in Brazil for the years 2016 and 2017

In [24]:
brazil_holidays = get_brazil_holidays([2016, 2017])
In [25]:
brazil_holidays
Out[25]:
['2016-01-01',
 '2016-04-21',
 '2016-05-01',
 '2016-09-07',
 '2016-10-12',
 '2016-11-02',
 '2016-11-15',
 '2016-12-25',
 '2016-03-25',
 '2016-03-27',
 '2016-05-26',
 '2016-02-10',
 '2016-02-09',
 '2017-01-01',
 '2017-04-21',
 '2017-05-01',
 '2017-09-07',
 '2017-10-12',
 '2017-11-02',
 '2017-11-15',
 '2017-12-25',
 '2017-04-14',
 '2017-04-16',
 '2017-06-15',
 '2017-03-01',
 '2017-02-28']
In [26]:
df = preprocess_data(df)
In [27]:
df.head(3)
Out[27]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday WeekofYear Season Is_Holiday
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 Friday 53 Summer 1
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 Saturday 53 Summer 0
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 Wednesday 1 Summer 0

Univariate Analysis

Reusable functions

In [28]:
def univariate_analysis(col, df, height=600, width=900):
  fig = make_subplots(rows=1, cols=2, specs=[[{"type": "xy"}, {"type": "domain"}]])

  labels = df[col].value_counts().index
  values = df[col].value_counts().values
  colors = px.colors.qualitative.Plotly + px.colors.qualitative.D3 + px.colors.qualitative.Vivid

  fig.add_trace(go.Bar(x=labels, y=values, name=col, marker=dict(color=colors), showlegend=False), row=1, col=1)
  fig.add_trace(go.Pie(labels=labels, values=values, name=col, marker=dict(colors=colors)), row=1, col=2)

  fig.update_layout(height=height, width=width, legend=dict(title=col))
  fig.show()
In [29]:
columns = df.drop(columns=['Date', 'Description', 'WeekofYear', 'Day']) .columns

for col in columns:
    univariate_analysis(col, df)

Bivariate Analysis

Reusable functions
In [30]:
def bivariate_analysis(df, col, hue):
  fig = px.histogram(df, x=df[col], color=hue, width=800, height=400, title=f'{hue} vs {col} analysis')
  fig.show()
Gender vs other field Analysis
In [31]:
hue = 'Gender'
columns = ['Employee type', 'Country', 'Industry Sector', 'Is_Holiday', 'Accident Level', 'Weekday', 'Year', 'Season']
for col in columns:
    bivariate_analysis(df, col, hue)
Accident Level vs other field Analysis
In [32]:
hue = 'Accident Level'
columns = ['Employee type', 'Country', 'Industry Sector', 'Is_Holiday', 'Gender', 'Weekday', 'Year', 'Season', 'Local']
for col in columns:
    bivariate_analysis(df, col, hue)

Multivariate Analysis

In [33]:
def pre_process_for_ml(df):
    df['Country'] = df['Country'].replace({'Country_01': 1, 'Country_02': 2, 'Country_03': 3})
    df['Local'] = df['Local'].replace({'Local_01': 1, 'Local_02': 2, 'Local_03': 3, 'Local_04': 4, \
                                        'Local_05': 5, 'Local_06': 6, 'Local_07': 7, 'Local_08': 8, \
                                        'Local_09': 9, 'Local_10': 10,  'Local_11': 11, 'Local_12': 12})
    df['Industry Sector'] = df['Industry Sector'].replace({'Mining': 1, 'Metals': 2, 'Others': 3})
    df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 2})
    df['Employee type'] = df['Employee type'].replace({'Third Party': 1, 'Employee': 2, 'Third Party (Remote)': 3})
    df['Critical Risk'] = LabelEncoder().fit_transform(df['Critical Risk'])
    df['Year'] = df['Year'].replace({'2016': 1, '2017': 2})
    df['Weekday'] = df['Weekday'].replace({'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4,\
                                          'Friday': 5, 'Saturday': 6, 'Sunday': 7})
    df['Season'] = df['Season'].replace({'Summer': 1, 'Autumn': 2, 'Winter': 3, 'Spring': 4})
    
    df['Accident Level'] = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5, 'VI': 6})
    df['Potential Accident Level'] = df['Potential Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5, 'VI': 6})
    return df
In [34]:
new_df = pre_process_for_ml(df.copy())
In [35]:
plt.figure(figsize=(15,6))
sns.heatmap(new_df.corr(), annot=True, cmap='Blues')
plt.show()
In [36]:
df.describe().T.style.bar(
    subset=['mean'],
    color='Reds').background_gradient(
    subset=['std'], cmap='ocean').background_gradient(subset=['50%'], cmap='PuBu')
Out[36]:
  count mean std min 25% 50% 75% max
Year 418.000000 2016.322967 0.468170 2016.000000 2016.000000 2016.000000 2017.000000 2017.000000
Month 418.000000 5.267943 3.186449 1.000000 3.000000 5.000000 7.000000 12.000000
Day 418.000000 15.076555 8.618416 1.000000 8.000000 15.000000 22.000000 31.000000
WeekofYear 418.000000 21.033493 13.998418 1.000000 9.000000 18.000000 30.000000 53.000000
Is_Holiday 418.000000 0.023923 0.152994 0.000000 0.000000 0.000000 0.000000 1.000000

NLP Preprocessing

In [37]:
def preprocess_text(text):
    # Expand the contractions
    text = contractions.fix(text)

    # Remove URLs
    text = re.sub(r"https?://\S+|www\.\S+", "", text)

    # Remove HTML tags if any
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    text = re.sub(html, "", text)

    # Remove Non ANSCI
    text = re.sub(r'[^\x00-\x7f]', "", text)

    # Remove imojis
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  
        u'\U0001F300-\U0001F5FF'  
        u'\U0001F680-\U0001F6FF'  
        u'\U0001F1E0-\U0001F1FF'  
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Remove all special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove unnecessary spaces
    text = text.strip()
    
    tokens = word_tokenize(text)
    word_list = [w for w in tokens if w not in stopwords.words('english')]
    return word_list

def stem_text(text):
    stemmer = PorterStemmer()
    stems = [stemmer.stem(i) for i in text]
    return stems

def extract_pos_tags(text):
    words = word_tokenize(text)
    tagged = nltk.pos_tag(words)
    return tagged

wordnet_map = {
    "N":wordnet.NOUN, 
    "V":wordnet.VERB, 
    "J":wordnet.ADJ, 
    "R":wordnet.ADV
}
    
train_sents = brown.tagged_sents(categories='news')
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

def extract_pos_tags(text, pos_tag_type="pos_tag"):
    pos_tagged_text = t2.tag(text)
    pos_tagged_text = [(word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys() else (word, wordnet.NOUN) for (word, pos_tag) in pos_tagged_text ]
    return pos_tagged_text

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemma = [lemmatizer.lemmatize(word, tag) for word, tag in text]
    return lemma
In [38]:
df['description_processed'] = df['Description'].apply(lambda t: ' '.join(preprocess_text(t)))
df['description_processed_stemmed'] = df['Description'].apply(lambda t: preprocess_text(t)).apply(lambda t: ' '.join(stem_text(t)))
df['description_processed_lemmatized'] =  df['Description'].apply(lambda t: preprocess_text(t)).apply(lambda t: extract_pos_tags(t)).apply(lambda t: ' '.join(lemmatize_text(t)))
In [39]:
df.head(3)
Out[39]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday WeekofYear Season Is_Holiday description_processed description_processed_stemmed description_processed_lemmatized
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 Friday 53 Summer 1 removing drill rod jumbo 08 maintenance superv... remov drill rod jumbo 08 mainten supervisor pr... removing drill rod jumbo 08 maintenance superv...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 Saturday 53 Summer 0 activation sodium sulphide pump piping uncoupl... activ sodium sulphid pump pipe uncoupl sulfid ... activation sodium sulphide pump pip uncoupled ...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 Wednesday 1 Summer 0 substation milpo located level 170 collaborato... substat milpo locat level 170 collabor excav w... substation milpo locate level 170 collaborator...

Word Cloud

In [40]:
def draw_wordcloud(df, col, bigrams=True):
    text = " ".join(i for i in df[col])
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(stopwords=stopwords, background_color="whitesmoke", collocations=bigrams).generate(text)
    plt.figure( figsize=(15,10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()
Word Cloud from description_processed_stemmed field
In [41]:
draw_wordcloud(df, 'description_processed_lemmatized', bigrams=True)

Here we can see the most used words on the description_processed_lemmatized field with bigrams as True. Few of them are listed below:

  1. employee
  2. operator
  3. hit
  4. activity
  5. collaborator
In [42]:
draw_wordcloud(df, 'description_processed_lemmatized', bigrams=False)

Here we can see the most used words on the description_processed_lemmatized field with bigrams as False. Few of them are listed below:

  1. hand
  2. employee
  3. leave
  4. causing
  5. right

Save the cleaned dataset to CSV file

In [43]:
df.head(3)
Out[43]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday WeekofYear Season Is_Holiday description_processed description_processed_stemmed description_processed_lemmatized
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 Friday 53 Summer 1 removing drill rod jumbo 08 maintenance superv... remov drill rod jumbo 08 mainten supervisor pr... removing drill rod jumbo 08 maintenance superv...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 Saturday 53 Summer 0 activation sodium sulphide pump piping uncoupl... activ sodium sulphid pump pipe uncoupl sulfid ... activation sodium sulphide pump pip uncoupled ...
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6 Wednesday 1 Summer 0 substation milpo located level 170 collaborato... substat milpo locat level 170 collabor excav w... substation milpo locate level 170 collaborator...
In [3]:
df.to_csv('Accident_data_cleansed.csv', index=False) 
files.download('Accident_data_cleansed.csv')

MILESTONE 2

Read the input dataset generates as part of Milestone 1

In [45]:
df = pd.read_csv('Accident_data_cleansed.csv')
In [46]:
df.head(2)
Out[46]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee type Critical Risk Description Year Month Day Weekday WeekofYear Season Is_Holiday description_processed description_processed_stemmed description_processed_lemmatized
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1 Friday 53 Summer 1 removing drill rod jumbo 08 maintenance superv... remov drill rod jumbo 08 mainten supervisor pr... removing drill rod jumbo 08 maintenance superv...
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2 Saturday 53 Summer 0 activation sodium sulphide pump piping uncoupl... activ sodium sulphid pump pipe uncoupl sulfid ... activation sodium sulphide pump pip uncoupled ...

Verify data imbalance

In [47]:
univariate_analysis('Accident Level', df)

Here we can see the dataset is imbalanced by considering the Accident Level field as the target variable.

  • 73.9% of data lies under Accident Level I
  • 9.57% of data lies under Accident Level II
  • 7.42% of data lies under Accident Level III
  • 7.18% of data lies under Accident Level IV
  • 1.91% of data lies under Accident Level V

Prepare test and train datasets with imabalnced data

Reusable Functions

In [48]:
def get_vocabularies(X):
    vocabulary = set()
    for description in X:
        words = word_tokenize(description)
        vocabulary.update(words)

    vocabulary = list(vocabulary)
    return vocabulary

Prepare X and y

In [49]:
X = df['description_processed_lemmatized']
y = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5})
In [50]:
print(X.shape, y.shape)
(418,) (418,)

Get Vocabularies

In [51]:
vocabulary = get_vocabularies(X.values)
print(len(vocabulary))
2975

Prepare Train and Test sets

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_state=7)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(334,) (334,) (84,) (84,)

Prepare TF-IDF Vectorizer

In [53]:
stop_words = stopwords.words('english') + list(punctuation)
vectorizer = TfidfVectorizer(stop_words=stop_words, tokenizer=word_tokenize, vocabulary=vocabulary)

Fit the Vectorizer

In [54]:
vectorizer.fit(X_train)
Out[54]:
TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function word_tokenize at 0x000001D0FD7E8940>,
                vocabulary=['crown', 'pivot', 'purification', 'ordinary',
                            'bolt', 'applies', 'elevation', 'branch',
                            'assembling', 'roger', 'shaft', 'new', 'knuckle',
                            'respective', 'technical', 'a1', 'chagua', 'spare',
                            'spin', 'virdro', 'isidro', 'melting', 'thrust',
                            'ob1', 'mx12', 'teammate', 'sodium', 'sudden',
                            'scooptram', 'mount', ...])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function word_tokenize at 0x000001D0FD7E8940>,
                vocabulary=['crown', 'pivot', 'purification', 'ordinary',
                            'bolt', 'applies', 'elevation', 'branch',
                            'assembling', 'roger', 'shaft', 'new', 'knuckle',
                            'respective', 'technical', 'a1', 'chagua', 'spare',
                            'spin', 'virdro', 'isidro', 'melting', 'thrust',
                            'ob1', 'mx12', 'teammate', 'sodium', 'sudden',
                            'scooptram', 'mount', ...])

Vectorize X_train and X_test

In [55]:
X_train_vec = vectorizer.transform(X_train)
X_test_vec = vectorizer.transform(X_test)
print(X_train_vec.shape, X_test_vec.shape)
(334, 2975) (84, 2975)

Prepare test and train datasets with over-sampled (balanced) data

Prepare X_o and y_o

In [56]:
X_o = df[['description_processed_lemmatized']]
y_o = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5})

Prepare Train and Test sets

In [57]:
X_train_o, X_test_o, y_train_o, y_test_o = train_test_split(X_o, y_o, test_size=0.2, random_state=7)

Prepare RandomOverSampler object

In [58]:
random_over = RandomOverSampler()

Analyze the target class counts before over-sampling

In [59]:
print("Before UpSampling, counts of label 'I': {}".format(sum(y_train_o==1)))
print("Before UpSampling, counts of label 'II': {}".format(sum(y_train_o==2)))
print("Before UpSampling, counts of label 'III': {}".format(sum(y_train_o==3)))
print("Before UpSampling, counts of label 'IV': {}".format(sum(y_train_o==4)))
print("Before UpSampling, counts of label 'V': {}".format(sum(y_train_o==5)))
Before UpSampling, counts of label 'I': 243
Before UpSampling, counts of label 'II': 32
Before UpSampling, counts of label 'III': 23
Before UpSampling, counts of label 'IV': 29
Before UpSampling, counts of label 'V': 7

Apply over-sampling

In [60]:
X_train_o, y_train_o = random_over.fit_resample(X_train_o, y_train_o.ravel())

Analyze the target class counts after over-sampling

In [61]:
print("Before UpSampling, counts of label 'I': {}".format(sum(y_train_o==1)))
print("Before UpSampling, counts of label 'II': {}".format(sum(y_train_o==2)))
print("Before UpSampling, counts of label 'III': {}".format(sum(y_train_o==3)))
print("Before UpSampling, counts of label 'IV': {}".format(sum(y_train_o==4)))
print("Before UpSampling, counts of label 'V': {}".format(sum(y_train_o==5)))
Before UpSampling, counts of label 'I': 243
Before UpSampling, counts of label 'II': 243
Before UpSampling, counts of label 'III': 243
Before UpSampling, counts of label 'IV': 243
Before UpSampling, counts of label 'V': 243

Vectorize X_train_o and X_test_o

In [62]:
X_train_o_vect = vectorizer.transform(X_train_o['description_processed_lemmatized'].values)
X_test_o_vect = vectorizer.transform(X_test_o['description_processed_lemmatized'].values)

Machine learning classifiers

Reusable Functions

In [63]:
def get_ml_model_results(X_train, y_train, X_test, y_test):
    models = {
        'Multinomial NB': MultinomialNB(),
        'Logistic Regression': LogisticRegression(),
        'Gaussian NB': GaussianNB(),
        'KNN': KNeighborsClassifier(),
        'SVM': SVC(),
        'Decision Tree': DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=50, min_samples_leaf=7),
        'Random Forest': RandomForestClassifier(n_estimators=50, max_samples=7),
        'Bagging': BaggingClassifier(n_estimators=100, max_samples=10),
        'Ada Boost': AdaBoostClassifier(n_estimators=100),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, learning_rate=0.05),
        'RidgeClassifier': RidgeClassifier(random_state=1),
    }
    
    names = []
    prediction = []
    train_scores = []
    test_scores = []
    precision_scores = []
    recall_scores = []
    f1_scores = []
        
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        train_score = model.score(X_train, y_train)
        test_score = model.score(X_test, y_test)
        ps, rs, fs, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
        
        names.append(name)
        prediction.append(y_pred)
        train_scores.append(round(train_score * 100, 2))
        test_scores.append(round(test_score * 100, 2))
        precision_scores.append(round(ps * 100, 2))
        recall_scores.append(round(rs * 100, 2))
        f1_scores.append(round(fs * 100, 2))
    
    results = pd.DataFrame({
        'Model': names, 
        'Train Accuracy': train_scores, 
        'Test Accuracy': test_scores,
        'Precision': precision_scores,
        'Recall': recall_scores,
        'F1': f1_scores
    })
    
    return results

def get_ml_model(model, X_train, y_train):
    models = {
        'Multinomial NB': MultinomialNB(),
        'Logistic Regression': LogisticRegression(),
        'Gaussian NB': GaussianNB(),
        'KNN': KNeighborsClassifier(),
        'SVM': SVC(),
        'Decision Tree': DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=50, min_samples_leaf=7),
        'Random Forest': RandomForestClassifier(n_estimators=50, max_samples=7),
        'Bagging': BaggingClassifier(n_estimators=100, max_samples=10),
        'Ada Boost': AdaBoostClassifier(n_estimators=100),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, learning_rate=0.05),
        'RidgeClassifier': RidgeClassifier(random_state=1)
    }
        
    selected_model = models[model]
    if selected_model == None:
        print('Model not available')
        return
    
    selected_model.fit(X_train, y_train)
    return selected_model

def visualize_model_score(result_df, col):
  result_df = result_df.sort_values(by=col, ascending=False)
  fig = px.bar(result_df, x='Model', y=col, color=col, \
               width=1200, height=500, text=[f'{str(i)} %' for i in result_df[col]],\
               color_continuous_scale='blugrn')
  fig.update_layout(title=f'{col} score of Models', title_x=0.5)
  fig.update_yaxes(range=[0, 100])
  fig.show()

Analyze Model scores with Imbalanced data

In [64]:
model_results = get_ml_model_results(X_train_vec.todense(), y_train, X_test_vec.todense(), y_test)
In [65]:
model_results.sort_values(by='Test Accuracy', ascending=False)
Out[65]:
Model Train Accuracy Test Accuracy Precision Recall F1
0 Multinomial NB 72.75 78.57 61.73 78.57 69.14
1 Logistic Regression 72.75 78.57 61.73 78.57 69.14
2 Gaussian NB 99.40 78.57 61.73 78.57 69.14
4 SVM 79.64 78.57 61.73 78.57 69.14
6 Random Forest 72.75 78.57 61.73 78.57 69.14
7 Bagging 72.75 78.57 61.73 78.57 69.14
10 RidgeClassifier 97.90 78.57 61.73 78.57 69.14
3 KNN 73.95 76.19 62.86 76.19 68.88
8 Ada Boost 74.25 76.19 61.32 76.19 67.95
9 Gradient Boosting 99.40 76.19 62.08 76.19 68.42
5 Decision Tree 76.35 71.43 61.22 71.43 65.93
In [66]:
visualize_model_score(model_results, 'Train Accuracy')
In [67]:
visualize_model_score(model_results, 'Test Accuracy')
In [68]:
visualize_model_score(model_results, 'Precision')
In [69]:
visualize_model_score(model_results, 'Recall')
In [70]:
visualize_model_score(model_results, 'F1')

Analyze the prediction capability of best performing model

Get the best performing ML model
In [71]:
best_ml_model = get_ml_model('Multinomial NB', X_train_vec.todense(), y_train)
Prepare the evaluation datasets for each target classes
In [72]:
I = df[df['Accident Level']=='I']
II = df[df['Accident Level']=='II']
III = df[df['Accident Level']=='III']
IV = df[df['Accident Level']=='IV']
V = df[df['Accident Level']=='V']
Predict the target class with Desciption text
In [73]:
I_input = I['Description'].iloc[10]
best_ml_model.predict(vectorizer.transform([I_input]))
Out[73]:
array([1], dtype=int64)
In [74]:
II_input = II['Description'].iloc[10]
best_ml_model.predict(vectorizer.transform([II_input]))
Out[74]:
array([1], dtype=int64)
Conclusion

Since we have used imabalced data for the creating model, the best performing model is not predicting correctly

Analyze Model scores with balanced data

In [75]:
model_results_o = get_ml_model_results(X_train_o_vect.todense(), y_train_o, X_test_o_vect.todense(), y_test_o)
In [76]:
model_results_o.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
Out[76]:
Model Train Accuracy Test Accuracy Precision Recall F1
2 Gaussian NB 99.26 78.57 61.73 78.57 69.14
4 SVM 99.18 78.57 61.73 78.57 69.14
1 Logistic Regression 99.18 76.19 62.08 76.19 68.42
10 RidgeClassifier 99.18 76.19 67.42 76.19 70.18
9 Gradient Boosting 99.09 65.48 61.73 65.48 63.55
0 Multinomial NB 96.30 60.71 75.38 60.71 66.10
3 KNN 91.52 41.67 64.91 41.67 50.34
5 Decision Tree 82.63 54.76 64.97 54.76 59.38
7 Bagging 65.43 22.62 66.91 22.62 29.20
6 Random Forest 52.92 51.19 74.59 51.19 58.65
8 Ada Boost 28.48 8.33 0.81 8.33 1.48
In [77]:
visualize_model_score(model_results_o, 'Train Accuracy')
In [78]:
visualize_model_score(model_results_o, 'Test Accuracy')
In [79]:
visualize_model_score(model_results_o, 'Precision')
In [80]:
visualize_model_score(model_results_o, 'Recall')
In [81]:
visualize_model_score(model_results_o, 'F1')

Analyze the prediction capability of best performing model

Get the best performing ML model
In [82]:
best_ml_model_bal = get_ml_model('SVM', X_train_o_vect, y_train_o)
Predict the target class with Desciption text
In [83]:
I_input = I['Description'].iloc[10]
best_ml_model_bal.predict(vectorizer.transform([I_input]))
Out[83]:
array([1], dtype=int64)
In [84]:
V_input = V['Description'].iloc[5]
best_ml_model_bal.predict(vectorizer.transform([V_input]))
Out[84]:
array([5], dtype=int64)
In [85]:
III_input = III['Description'].iloc[5]
best_ml_model_bal.predict(vectorizer.transform([III_input]))
Out[85]:
array([3], dtype=int64)
Conclusion

This time we have used balanced dataset for preparing the model, also we can see the model is predicting well. Going forward we build be building all the models with Upsampled(balanced) dataset, i.e. X_train_o, y_train_o, X_test_o and y_test_o

Neural networks classifiers¶

Reusable Functions

In [86]:
def reset_seeds(seed):
   np.random.seed(seed) 
   python_random.seed(seed)
   set_seed(seed)
In [87]:
def get_basic_nn_model(X_train):
  reset_seeds(0)
  clear_session()

  model = Sequential()
  model.add(Dense(64, input_shape=(X_train.shape[1], )))
  model.add(Activation('sigmoid'))
  model.add(Dense(24))
  model.add(Activation('sigmoid'))
  model.add(Dense(6))
  model.add(Activation('softmax'))
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
  return model

def get_nn_model_with_weight(X_train):
  reset_seeds(0)
  clear_session()

  model = Sequential()
  model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
  model.add(Activation('sigmoid'))
  model.add(Dense(24, kernel_initializer='he_normal'))
  model.add(Activation('sigmoid'))
  model.add(Dense(6, kernel_initializer='he_normal'))
  model.add(Activation('softmax'))
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
  return model

def get_nn_model_with_relu(X_train):
  reset_seeds(0)
  clear_session()

  model = Sequential()
  model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
  model.add(Activation('relu'))
  model.add(Dense(24))
  model.add(Activation('relu'))
  model.add(Dense(6))
  model.add(Activation('softmax'))
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
  return model

def get_nn_model_with_batch_normalization(X_train):
  reset_seeds(0)
  clear_session()

  model = Sequential()
  model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dense(24))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dense(6))
  model.add(Activation('softmax'))
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
  return model

def get_nn_model_with_dropout(X_train):
  reset_seeds(0)
  clear_session()

  model = Sequential()
  model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.2))
  model.add(Dense(24))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.2))
  model.add(Dense(6))
  model.add(Activation('softmax'))
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
  return model
In [88]:
def get_nn_model_results(X_train, y_train, X_test, y_test):
  models = {
      'Basic NN Model': get_basic_nn_model(X_train),
      'NN Model with Weight Initialization': get_nn_model_with_weight(X_train),
      'NN Model with Relu Activation': get_nn_model_with_relu(X_train),
      'NN Model with Batch Normalization': get_nn_model_with_batch_normalization(X_train),
      'NN Model with Dropout': get_nn_model_with_dropout(X_train)
  }

  names = []
  train_scores = []
  test_scores = []
  precision_scores = []
  recall_scores = []
  f1_scores = []

  for name, model in models.items():
    model.fit(X_train, y_train, epochs=100, batch_size=8, validation_data=(X_test, y_test), verbose=False)
    train_loss, train_score = model.evaluate(X_train, y_train, verbose=0)
    test_loss, test_score = model.evaluate(X_test, y_test, verbose=0)

    y_pred = model.predict(X_test, batch_size=64, verbose=0)
    y_pred_bool = np.argmax(y_pred, axis=1)
    y_test_bool = np.argmax(y_test, axis=1)
    ps, rs, fs, _ = np.average(precision_recall_fscore_support(y_test_bool, y_pred_bool), axis=1)

    names.append(name)
    train_scores.append(round(train_score * 100, 2))
    test_scores.append(round(test_score * 100, 2))
    precision_scores.append(round(ps * 100, 2))
    recall_scores.append(round(rs * 100, 2))
    f1_scores.append(round(fs * 100, 2))
    

  results = pd.DataFrame({
    'Model': names, 
    'Train Accuracy': train_scores, 
    'Test Accuracy': test_scores,
    'Precision': precision_scores,
    'Recall': recall_scores,
    'F1': f1_scores
  })

  return results

def get_nn_model(model, X_train, y_train):
  models = {
      'Basic NN Model': get_basic_nn_model(X_train),
      'NN Model with Weight Initialization': get_nn_model_with_weight(X_train),
      'NN Model with Relu Activation': get_nn_model_with_relu(X_train),
      'NN Model with Batch Normalization': get_nn_model_with_batch_normalization(X_train),
      'NN Model with Dropout': get_nn_model_with_dropout(X_train)
  }

  selected_model = models[model]
  if selected_model == None:
    print('Model not available')
    return
    
  selected_model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=False)
  return selected_model

Convert target classes to categories

In [89]:
y_train_o_cat = to_categorical(y_train_o, num_classes=None)
y_test_o_cat = to_categorical(y_test_o, num_classes=None)

Analyze NN Model scores with balanced data

In [90]:
nn_model_results = get_nn_model_results(X_train_o_vect.todense(), y_train_o_cat, X_test_o_vect.todense(), y_test_o_cat)
WARNING:tensorflow:5 out of the last 9 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000001D0868F3AF0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
In [91]:
nn_model_results.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
Out[91]:
Model Train Accuracy Test Accuracy Precision Recall F1
2 NN Model with Relu Activation 99.26 77.38 35.80 21.89 21.86
3 NN Model with Batch Normalization 99.26 75.00 15.95 19.09 17.38
4 NN Model with Dropout 99.26 75.00 16.15 19.09 17.50
1 NN Model with Weight Initialization 48.81 67.86 17.31 36.97 19.21
0 Basic NN Model 40.00 78.57 15.71 20.00 17.60
In [92]:
visualize_model_score(nn_model_results, 'Train Accuracy')
In [93]:
visualize_model_score(nn_model_results, 'Test Accuracy')
In [94]:
visualize_model_score(nn_model_results, 'Precision')
In [95]:
visualize_model_score(nn_model_results, 'Recall')
In [96]:
visualize_model_score(nn_model_results, 'F1')

Analyze the prediction capability of best performing model

Get the best performing NN model
In [97]:
best_nn_model = get_nn_model('NN Model with Relu Activation', X_train_o_vect.todense(), y_train_o_cat)
Predict the target class with Desciption text
In [98]:
np.argmax(best_nn_model.predict(vectorizer.transform([I['Description'].iloc[13]]).todense()))
WARNING:tensorflow:6 out of the last 11 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000001D0874BB8B0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
1/1 [==============================] - 0s 41ms/step
Out[98]:
1
In [99]:
np.argmax(best_nn_model.predict(vectorizer.transform([III['Description'].iloc[13]]).todense()))
1/1 [==============================] - 0s 18ms/step
Out[99]:
3
In [100]:
np.argmax(best_nn_model.predict(vectorizer.transform([V['Description'].iloc[7]]).todense()))
1/1 [==============================] - 0s 18ms/step
Out[100]:
5

Save the best performing model

In [101]:
best_nn_model.save('best_nn_model.h5')

Load back the best performing model

In [102]:
loaded_model = load_model('best_nn_model.h5')
Predict the target class with Desciption text
In [103]:
np.argmax(loaded_model.predict(vectorizer.transform(['Hand cut off']).todense()))
1/1 [==============================] - 0s 45ms/step
Out[103]:
1
In [104]:
np.argmax(loaded_model.predict(vectorizer.transform([V['Description'].iloc[7]]).todense()))
1/1 [==============================] - 0s 17ms/step
Out[104]:
5

Conclusion

The best performing model is predicting well.

RNN or LSTM classifiers¶

Reusable Functions

In [170]:
def get_simple_lstm_model(max_len, top_words):
  clear_session()
  embedding_vecor_length = 32

  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
  model.add(LSTM(100))
  model.add(Dense(6, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

def get_lstm_with_dropout_model(max_len, top_words):
  clear_session()
  embedding_vecor_length = 32

  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
  model.add(Dropout(0.2))
  model.add(LSTM(100))
  model.add(Dropout(0.2))
  model.add(Dense(6, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

def get_bidirectional_lstm_model(max_len, top_words):
  clear_session()
  embedding_vecor_length = 32

  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
  model.add(Dropout(0.2))
  model.add(Bidirectional(LSTM(100)))
  model.add(Dropout(0.2))
  model.add(Dense(6, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

def get_lstm_and_cnn_model(max_len, top_words):
  clear_session()
  embedding_vecor_length = 32

  model = Sequential()
  model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
  model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
  model.add(MaxPooling1D(pool_size=2))
  model.add(LSTM(100))
  model.add(Dense(6, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

def get_rnn_lstm_model_results(X_train, y_train, X_test, y_test, max_len, top_words):
  models = {
      'Simple LSTM Model': get_simple_lstm_model(max_len, top_words),
      'LSTM with Droput': get_lstm_with_dropout_model(max_len, top_words),
      'Bidirectional LSTM': get_bidirectional_lstm_model(max_len, top_words),
      'LSTM and CNN': get_lstm_and_cnn_model(max_len, top_words)
  }

  names = []
  train_scores = []
  test_scores = []
  precision_scores = []
  recall_scores = []
  f1_scores = []

  for name, model in models.items():
    print('Preparing model ', name)
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64, verbose=False)
    train_loss, train_score = model.evaluate(X_train, y_train, verbose=0)
    test_loss, test_score = model.evaluate(X_test, y_test, verbose=0)

    y_pred = model.predict(X_test, batch_size=64, verbose=0)
    y_pred_bool = np.argmax(y_pred, axis=1)
    y_test_bool = np.argmax(y_test, axis=1)
    ps, rs, fs, _ = np.average(precision_recall_fscore_support(y_test_bool, y_pred_bool), axis=1)

    names.append(name)
    train_scores.append(round(train_score * 100, 2))
    test_scores.append(round(test_score * 100, 2))
    precision_scores.append(round(ps * 100, 2))
    recall_scores.append(round(rs * 100, 2))
    f1_scores.append(round(fs * 100, 2))
    

  results = pd.DataFrame({
    'Model': names, 
    'Train Accuracy': train_scores, 
    'Test Accuracy': test_scores,
    'Precision': precision_scores,
    'Recall': recall_scores,
    'F1': f1_scores
  })

  return results

def get_rnn_lstm_model(model, X_train, y_train, max_len, top_words):
  models = {
      'Simple LSTM Model': get_simple_lstm_model(max_len, top_words),
      'LSTM with Droput': get_lstm_with_dropout_model(max_len, top_words),
      'Bidirectional LSTM': get_bidirectional_lstm_model(max_len, top_words),
      'LSTM and CNN': get_lstm_and_cnn_model(max_len, top_words)
  }

  selected_model = models[model]
  if selected_model == None:
    print('Model not available')
    return
    
  selected_model.fit(X_train, y_train, epochs=10, batch_size=64, verbose=False)
  return selected_model

def prepare_input(text):
    input_seq = tokenizer.texts_to_sequences([text])
    input_pad = pad_sequences(input_seq, maxlen=max_len)
    return input_pad

Data Preparation

Here we are re-using the already over-sampled train and test sets, which are - X_train_o, X_test_o, y_train_o and y_test_o

Prepare Tokenizer
In [171]:
X = pd.concat([X_train_o, X_test_o])
In [172]:
top_words = 5000
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(X['description_processed_lemmatized'])
In [173]:
max_len = len(tokenizer.word_index)
In [174]:
max_len
Out[174]:
2975
Convert Text to Sequences
In [175]:
X_train_o_seq = tokenizer.texts_to_sequences(X_train_o['description_processed_lemmatized'])
X_test_o_seq = tokenizer.texts_to_sequences(X_test_o['description_processed_lemmatized'])
Convert Sequences to Padded Sequences
In [176]:
X_train_o_pad = pad_sequences(X_train_o_seq, maxlen=max_len)
X_test_o_pad = pad_sequences(X_test_o_seq, maxlen=max_len)
Convert y values to Categorical
In [177]:
y_train_o_cat = to_categorical(y_train_o, num_classes=None)
y_test_o_cat = to_categorical(y_test_o, num_classes=None)

Analyze RNN or LSTM Model scores with balanced data

In [178]:
rnn_lstm_model_results = get_rnn_lstm_model_results(X_train_o_pad, y_train_o_cat, X_test_o_pad, y_test_o_cat, max_len, top_words)
Preparing model  Simple LSTM Model
Preparing model  LSTM with Droput
Preparing model  Bidirectional LSTM
Preparing model  LSTM and CNN
In [179]:
rnn_lstm_model_results.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
Out[179]:
Model Train Accuracy Test Accuracy Precision Recall F1
3 LSTM and CNN 99.26 73.81 29.12 23.18 24.17
1 LSTM with Droput 99.18 59.52 24.95 21.74 21.88
2 Bidirectional LSTM 91.85 41.67 21.26 23.79 19.40
0 Simple LSTM Model 86.67 41.67 36.64 32.50 17.83
In [180]:
visualize_model_score(rnn_lstm_model_results, 'Train Accuracy')
In [181]:
visualize_model_score(rnn_lstm_model_results, 'Test Accuracy')
In [182]:
visualize_model_score(rnn_lstm_model_results, 'Precision')
In [183]:
visualize_model_score(rnn_lstm_model_results, 'Recall')
In [184]:
visualize_model_score(rnn_lstm_model_results, 'F1')

Analyze the prediction capability of best performing LSTM model

Get the best performing LSTM model
In [185]:
best_lstm_model = get_rnn_lstm_model('LSTM and CNN', X_train_o_pad, y_train_o_cat, max_len, top_words)
Predict the target class with Desciption text
In [159]:
input_I = prepare_input(I['Description'].iloc[5])
np.argmax(best_lstm_model.predict(input_I))
1/1 [==============================] - 0s 76ms/step
Out[159]:
1
In [160]:
input_III = prepare_input(III['Description'].iloc[2])
np.argmax(best_lstm_model.predict(input_III))
1/1 [==============================] - 0s 77ms/step
Out[160]:
3
In [161]:
input_V = prepare_input(V['Description'].iloc[5])
np.argmax(best_lstm_model.predict(input_V))
1/1 [==============================] - 0s 73ms/step
Out[161]:
5

Save the best performing model

In [186]:
best_lstm_model.save('best_lstm_model.h5')
In [188]:
I = df[df['Accident Level']=='I']
II = df[df['Accident Level']=='II']
II = df[df['Accident Level']=='III']
IV = df[df['Accident Level']=='IV']
V = df[df['Accident Level']=='V']

Load back the best performing model

In [189]:
loaded_lstm_model = load_model('best_lstm_model.h5')
Predict the target class with Desciption text
In [154]:
input_I = prepare_input(I['Description'].iloc[5])
np.argmax(loaded_lstm_model.predict(input_I))
1/1 [==============================] - 0s 402ms/step
Out[154]:
1
In [155]:
input_III = prepare_input(III['Description'].iloc[2])
np.argmax(loaded_lstm_model.predict(input_III))
1/1 [==============================] - 0s 152ms/step
Out[155]:
3
In [156]:
input_V = prepare_input(V['Description'].iloc[5])
np.argmax(loaded_lstm_model.predict(input_V))
1/1 [==============================] - 0s 147ms/step
Out[156]:
5

Conclusion

The best performing model is predicting well.

Choose the best performing model classifier and pickle it¶

Combine all the model results

In [190]:
all_model_scores = pd.concat([model_results_o, nn_model_results, rnn_lstm_model_results])
In [191]:
all_model_scores.sort_values(by=['Test Accuracy'], ascending=False)
Out[191]:
Model Train Accuracy Test Accuracy Precision Recall F1
2 Gaussian NB 99.26 78.57 61.73 78.57 69.14
4 SVM 99.18 78.57 61.73 78.57 69.14
0 Basic NN Model 40.00 78.57 15.71 20.00 17.60
2 NN Model with Relu Activation 99.26 77.38 35.80 21.89 21.86
10 RidgeClassifier 99.18 76.19 67.42 76.19 70.18
1 Logistic Regression 99.18 76.19 62.08 76.19 68.42
4 NN Model with Dropout 99.26 75.00 16.15 19.09 17.50
3 NN Model with Batch Normalization 99.26 75.00 15.95 19.09 17.38
3 LSTM and CNN 99.26 73.81 29.12 23.18 24.17
1 NN Model with Weight Initialization 48.81 67.86 17.31 36.97 19.21
9 Gradient Boosting 99.09 65.48 61.73 65.48 63.55
0 Multinomial NB 96.30 60.71 75.38 60.71 66.10
1 LSTM with Droput 99.18 59.52 24.95 21.74 21.88
5 Decision Tree 82.63 54.76 64.97 54.76 59.38
6 Random Forest 52.92 51.19 74.59 51.19 58.65
3 KNN 91.52 41.67 64.91 41.67 50.34
0 Simple LSTM Model 86.67 41.67 36.64 32.50 17.83
2 Bidirectional LSTM 91.85 41.67 21.26 23.79 19.40
7 Bagging 65.43 22.62 66.91 22.62 29.20
8 Ada Boost 28.48 8.33 0.81 8.33 1.48
In [192]:
visualize_model_score(all_model_scores, 'Train Accuracy')
In [193]:
visualize_model_score(all_model_scores, 'Test Accuracy')

Conclusion on model¶

Here we can see the top performing models are

  1. Gaussian NB
  2. SVM
  3. Basic NN Model

All of these models having the Test Accuracy of 78.57%

The next best performing model id NN Model with Relu Activation which having 77.38% accuracy

Pickle the Best performing model for integrating with Chatbot¶

Get the best model

In [194]:
best_model = get_ml_model('SVM', X_train_o_vect, y_train_o)

Save the best model

In [197]:
joblib.dump(best_model, 'best_model.pkl')
Out[197]:
['best_model.pkl']

Load the saved model

In [200]:
loaded_best_model = joblib.load('best_model.pkl')

Predict Accident Level using Loaded best model

In [ ]:
II_input = II['Description'].iloc[10]
In [202]:
loaded_best_model.predict(vectorizer.transform([II_input]))
Out[202]:
array([2], dtype=int64)
In [203]:
IV_input = IV['Description'].iloc[10]
In [204]:
loaded_best_model.predict(vectorizer.transform([IV_input]))
Out[204]:
array([4], dtype=int64)
In [ ]: